Model comparison (done correctly) helps to choose (or weight) models that provide a good representation of the true DGP by penalizing models that overfit. This penalization is achieved mainly by assessing “fit” not on a training data set, but on a hold out test data set.
A complementary approach to avoid overfitting is to shrink parameter estimates towards zero. This can be achieved with penalized likelihood approaches (as done in the frequentist paradigm) or by using priors that favor small parameter values over large values. Shrinkage priors typically
Here, relative refers to the scale on which a predictor is measured.
To visualize how shrinkage works, we generate a small data set for which we want to estimate the mean:
set.seed(5)
y = (rnorm(15) %>% scale()) + 1
par(mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.01)
hist(y)
We want a Bayesian estimate of the mean of y: \(\small y \sim normal(mu, 1)\) with two
different priors \(\small mu \sim
normal(0,\sigma_{wide})\) and \(\small
mu \sim normal(0,\sigma_{shrink})\) for mu:
par (mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.01)
curve(dnorm(x,0,1), -10, 10, n = 500,
ylab = "density", xlab = "mu", col = "blue", lwd = 3)
curve(dnorm(x,0,3), n = 500, add = T,
ylab = "density", xlab = "mu", col = adjustcolor("blue",alpha = .5), lwd = 3)
legend("topleft",
lty = 1, lwd = 3,
col = c("blue",adjustcolor("blue",alpha = .5)),
legend = c("normal(0,1)","normal(0,3)"),
bty = "n")
Here, we calculate the log-likelihood and the prior probability of
different estimates for the mean of y
mus = seq(-1,3,.01)
P =
apply(
data.frame(mu = mus), 1,
function(mu) {
c(
mu = mu,
# P(data| mu)
llikelihood = sum(dnorm(y,mean = mu, log = T)),
# P(mu), mu ~ normal(0,1)
Psd1 = dnorm(mu,0, sd = 1, log = T),
# P(mu), mu ~ normal(0,3)
Psd3 = dnorm(mu,0, sd = 3, log = T)
)
}
)
P[,1:5]
## [,1] [,2] [,3] [,4] [,5]
## mu.mu -1.000000 -0.990000 -0.980000 -0.970000 -0.960000
## llikelihood -50.784078 -50.484828 -50.187078 -49.890828 -49.596078
## Psd1.mu -1.418939 -1.408989 -1.399139 -1.389389 -1.379739
## Psd3.mu -2.073106 -2.072001 -2.070906 -2.069823 -2.068751
Next, we calculate the posterior probabilities:
post1 = exp(P["llikelihood",] + P["Psd1.mu",])
post3 = exp(P["llikelihood",] + P["Psd3.mu",])
Now we can show likelihood, prior and posterior together:
par(mfrow = c(3,1), mar=c(2.75,2.75,0,.25), mgp=c(1.25,.1,0), tck=-.01)
curve(dnorm(x,0,1), min(mus), max(mus), n = 500,
ylab = "P(mu)", xlab = "", col = "blue", lwd = 3)
curve(dnorm(x,0,3), n = 500, add = T,
ylab = "density", xlab = "", col = adjustcolor("blue",alpha = .5), lwd = 3)
legend("topright",
lty = 1, lwd = 3,
col = c("blue",adjustcolor("blue",alpha = .5)),
legend = c("mu ~ normal(0,1)","mu ~ normal(0,3)"),
bty = "n")
mtext("Prior", line = -2, adj = .05)
plot(mus,exp(P["llikelihood",]),'l', xlab = "", ylab = "P(y|mu)", col = "red", lwd = 2)
mtext("Likelihood", line = -2, adj = .05)
plot(mus,post1,'l', xlab = "mu", ylab = "P(y|mu) * P(mu)", col = "purple", lwd = 2)
lines(mus,post3,'l', col = adjustcolor("purple",alpha = .5), lwd = 2)
abline(v = c(mus[which.max(post1)],mus[which.max(post3)]), lty = 3,
col = c("purple",adjustcolor("purple",alpha = .5)))
legend("topright",
lty = 1, lwd = 3,
col = c("purple",adjustcolor("purple",alpha = .5)),
legend = c("mu ~ normal(0,1)","mu ~ normal(0,3)"),
bty = "n")
mtext("Posterior", line = -2, adj = .05)
Keep in mind that the posterior is the product of prior and likelihood (or the sum of log(prior) and log(likelihood)).
We zoom in to see more clearly how the narrow prior shrinks the estimated mu towards zero.
par (mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.01)
ids = which(mus > 0 & mus < 1.5)
plot(mus[ids],post1[ids],'l', xlab = "mu", ylab = "P(y|mu) * P(mu)", col = "purple", lwd = 2)
lines(mus[ids],post3[ids],'l', col = adjustcolor("purple",alpha = .5), lwd = 2)
abline(v = c(mus[which.max(post1)],mus[which.max(post3)]), lty = 3,
col = c("purple",adjustcolor("purple",alpha = .5)))
arrows(x0 = mus[which.max(post3)],
x1 = mus[which.max(post1)],
y0 = mean(c(max(post1),max(post3))),
lwd = 1, length = .1)
legend("topleft",
lty = 1, lwd = 3,
col = c("purple",adjustcolor("purple",alpha = .5)),
legend = c("mu ~ normal(0,1)","mu ~ normal(0,3)"),
bty = "n")
To show that shrinkage works, we estimate spline models with different standard deviations on regression coefficients for the simulated income / well being data above.
par(mfrow = c(2,1), mar=c(2.5,2.5,.5,1), mgp=c(1.75,.7,0), tck=-.01)
N = 1000
set.seed(123456)
x = sort(rnorm(N,mean = 0))
bf.s = splines::bs(x,df = 3)
bf.b = matrix(c(1,1,1),ncol = 1)
mu = bf.s %*% bf.b
y = rnorm(N, mu , sd = .1)
plot(x,y)
lines(x,mu, col = "red", lwd = 2)
matplot(x,bf.s,type = 'l', lty = 1)
The following animation shows the estimated relationships for different samples drawn from the population.